Back

Bioinformatics Advances

Oxford University Press (OUP)

Preprints posted in the last 90 days, ranked by how well they match Bioinformatics Advances's content profile, based on 184 papers previously published here. The average preprint has a 0.16% match score for this journal, so anything above that is already an above-average fit.

1
Discovering conserved regulatory modules in predicted gene regulatory networks across species

Zhang, J.; Heath, L. S.

2026-05-16 systems biology 10.64898/2026.05.15.725337 medRxiv
Top 0.1%
26.5%
Show abstract

The discovery of conserved regulatory motifs across different species is a fundamental challenge in systems biology, especially considering the noisy and incomplete nature of predicted gene regulatory networks (GRNs) and the intractability of the underlying graph alignment problem. Traditional network alignment methods frequently enforce one-to-one node mappings or strict topological isomorphism, which fail to accommodate the many-to-many orthology mappings caused by evolutionary gene duplication. Consequently, strict constraints often yield highly fragmented topological islands rather than cohesive functional modules. In this work, we propose a relaxed topological alignment algorithm designed to extract conserved regulatory structures from cross-species GRNs. We formulate the discovery process as a multi-objective optimization problem that balances sequence homology, functional coherence, and a normalized topological consensus. To navigate the exponentially scaling search space, we introduce a greedy seed-and-extend heuristic bounded by a dynamic{epsilon} -stopping condition, which evaluates marginal objective gains to prevent functional dilution. We validate our algorithm using time-series transcriptomic data from Arabidopsis thaliana, Zea mays, and Sorghum bicolor focused on drought and developmental stress responses. While a strict topological baseline extracted only fragmented subgraphs limited to 51 homologous tuples, our relaxed heuristic successfully converged on a highly connected 444-tuple module. The resulting topology effectively links strictly conserved upstream transcription factors to their highly duplicated, species-specific downstream pathways. Our algorithm provides a robust, scalable computational methodology for identifying core regulatory logic across complex biological systems, facilitating the translation of conserved network architectures among multiple species. Author summaryIdentifying shared regulatory mechanisms across diverse species is essential for understanding how complex biological systems evolve and adapt. However, traditional computer algorithms struggle to align these biological networks because evolution frequently duplicates genes, breaking simple one-to-one comparisons and producing highly fragmented results. To overcome this limitation, we developed a relaxed cross-species network alignment algorithm. Instead of demanding perfectly identical network shapes, our approach dynamically balances genetic sequence similarity, network structure, and biological function. We demonstrated the performance of our algorithm using plant drought-stress networks as a case study. While strict methods only found tiny, disconnected network fragments, our algorithm uncovered a functionally coherent, interconnected regulatory module across three distinct species. We discovered that while upstream command genes remain strictly conserved, they regulate highly customized, species-specific execution pathways downstream. Ultimately, our framework provides a scalable, species-agnostic method to decode complex systems, allowing researchers to translate conserved biological logic across diverse genomes.

2
A graph-based learning approach to predict the effects of gene perturbations on molecular phenotypes

Jin, Y.; Sverchkov, Y.; Sushkova, A.; Ohtake, M.; Emfinger, C.; Craven, M.

2026-03-23 systems biology 10.64898/2026.03.20.712202 medRxiv
Top 0.1%
19.0%
Show abstract

MotivationLarge-scale gene knockdown/knockout screens have been used to gain insight into a wide array of phenotypes and biological processes. However, conducting such experiments is expensive and labor-intensive. In this work, we present a general graph-based machine-learning approach that can predict the effects of gene perturbations on molecular phenotypes of interest given some measured phenotypic effects of other gene perturbations. The motivation for learning models that can predict the effects of gene perturbations is fourfold. Such models can (1) predict effects for unmeasured genes in cases in which cost or technical barriers preclude perturbing every gene, (2) prioritize unmeasured genes or sets of genes for subsequent perturbation experiments, (3) hypothesize mechanisms that underlie the relationships between the perturbed genes and their effects, and (4) generalize to other unmeasured phenotypes of interest. ResultsWe evaluate our approach by applying it, in conjunction with four different learning methods, to learn models for four varied phenotypes. Our empirical evaluation demonstrates that the learned models (1) show relatively high levels of predictive accuracy across the four phenotypes, (2) have better predictive accuracy than several standard baselines, (3) can often learn accurate models with small training sets, (4) benefit from having multiple sources of evidence in the input representation, (5) can, in many cases, transfer their predictive value to other phenotypes. Availability and ImplementationThe Assembled datasets and source code for this work is available at: https://github.com/Craven-Biostat-Lab/graph-molecular-phenotype-prediction

3
CDS-BART: A BART-Based Foundation Model for mRNA Sequence Analysis

Jadamba, E.; Lee, S.-H.; Hong, J.; Lee, H.; Lee, S.; Shin, H.

2026-03-11 bioinformatics 10.64898/2026.03.09.710670 medRxiv
Top 0.1%
18.9%
Show abstract

Summary: Recent advancements in artificial intelligence (AI) have led to the development of foundation models that interpret mRNA as a language. Notable examples include CodonBERT, hydraRNA, EVO2, and Helix-mRNA. These models demonstrate significant potential as powerful tools for mRNA research. However, to best of our knowledge, there is currently no publicly available AI model that is both easy to use and capable of analyzing mRNA sequences up to about 4kb, a length scale typical of many therapeutic mRNAs, including those encapsulated within lipid nanoparticls (LNPs). Thus, we propose CDS-BART, a user-friendly, open-source tool that integrates SentencePiece sub-word tokenization with the denoising sequence-to-sequence training of Bidirectional and Auto-Regressive Transformers (BART). CDS-BART was pre-trained on mRNA data from nine taxonomic groups provided by the NCBI RefSeq database. This comprehensive pre-training, coupled with BARTs denoising capability, ensures effective learning of codon usage, mRNA structure, evolution, and regulation. Thus, CDS-BART can ultimately deliver robust performance across a wide range of mRNA prediction tasks. Availability and ImplementationCDS-BART is released under the MIT License. Latest code is available via Github at https://github.com/mogam-ai/CDS-BART.

4
VaLPAS: Leveraging variation in experimental multi-omics data to elucidate protein function

Mahlich, Y.; Ross, D. H.; Monteiro, L.; McDermott, J. E.

2026-03-30 bioinformatics 10.64898/2026.03.26.712966 medRxiv
Top 0.1%
18.8%
Show abstract

MotivationDespite continuing advances in sequencing and computational function determination, large parts of the studied gene, protein, and metabolite space remain functionally undetermined. Most function assignment is driven by homology searches and annotation transfer from known and extensively studied proteins but often fails to leverage available experimental omics data generated via technologies like mass-spectrometry. ResultsThe VaLPAS (Variation-Leveraged Phenomic Association Screen) framework is available as a Python package and provides a user-friendly platform for calculation of associations between expression patterns of genes or proteins in multi-omic datasets based on various statistical and learning methods. The goal of this approach is to shed light on the functional dark matter of protein space by elucidating previously unknown functions of molecules using guilt by association with molecules of known function. We present results demonstrating the utility of VaLPAS to identify high-confidence predictions for a subset of genes/proteins of unknown function in a previously published multi-omics dataset from the oleaginous yeast, Rhodotorula toruloides. AvailabilityVaLPAS is written in Python. The code is hosted on github (https://github.com/PNNL-Predictive-Phenomics/valpas/).

5
STRmie-HD enables interruption-aware HTT repeat genotyping and somatic mosaicism profiling across sequencing platforms

Napoli, A.; Liorni, N.; Biagini, T.; Giovannetti, A.; Squitieri, A.; Miele, L.; Urbani, A.; Caputo, V.; Gasbarrini, A.; Squitieri, F.; Mazza, T.

2026-03-25 bioinformatics 10.64898/2026.03.21.713334 medRxiv
Top 0.1%
18.1%
Show abstract

Short tandem repeat expansions in exon 1 of the HTT gene drive Huntingtons disease (HD) pathogenesis, with disease onset and progression heavily influenced by somatic mosaicism and sequence interruptions. While sequencing technologies enable repeat sizing, many computational tools lack the resolution to capture subtle interruption motifs and allele-specific somatic variation. We present STRmie-HD, an alignment-free, de novo framework for interruption-aware genotyping and quantitative profiling of somatic mosaicism at single-read resolution. The tool parses individual reads to quantify uninterrupted CAG tract length, CCG repeat content, and critical interruption variants, including Loss of Interruption (LOI) and Duplication of Interruption (DOI). Validated across Illumina, PacBio SMRT, and Oxford Nanopore platforms, STRmie-HD demonstrates high concordance with reference genotypes and superior sensitivity in identifying rare interruption patterns that conventional tools often overlook. Furthermore, it implements somatic mosaicism metrics to characterize repeat dynamics, successfully distinguishing the higher somatic expansion burden in brain tissues compared to peripheral blood. STRmie-HD offers a comprehensive and extensible solution for high-resolution molecular characterization of HTT variation, providing a robust framework for patient stratification and genetic research in HD. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=72 SRC="FIGDIR/small/713334v1_ufig1.gif" ALT="Figure 1"> View larger version (27K): org.highwire.dtl.DTLVardef@17a54aforg.highwire.dtl.DTLVardef@4dcfc5org.highwire.dtl.DTLVardef@8398edorg.highwire.dtl.DTLVardef@1acefde_HPS_FORMAT_FIGEXP M_FIG Graphical Abstract: STRmie-HD flowchart. STRmie-HD is a comprehensive analytical framework that processes sequencing reads to analyze CAG/CCG trinucleotide repeats, interruption variants, and somatic mosaicism in the HTT gene. The workflow begins with sequencing reads (FASTA/FASTQ) that can undergo optional custom processing eq]based on the sequencing design. These reads are then fed into a regular expression-based engine (STRmie-HD) to identify CAG and CCG motifs. The identified motifs lead to the estimation of CAG/CCG alleles, visualized as distinct peaks representing different allele sizes, interruption variant assessment, and somatic mosaicism quantification. STRmie-HD produces an HTML output that wraps this information into a report. C_FIG

6
MethylCurate: Tool For Dataset Curation and Epigenetic Aging Clock Evaluation

Edwards, T. A.; Shen, L.; Long, Q.

2026-05-14 bioinformatics 10.64898/2026.05.11.723515 medRxiv
Top 0.1%
14.9%
Show abstract

SummaryDNA methylation datasets from public repositories such as NCBI Gene Expression Omnibus are central to the development and evaluation of epigenetic aging clocks, yet existing resources and tools do not fully resolve the bottlenecks of dataset retrieval and metadata harmonization. Current benchmarking frameworks often rely on static curated collections, support only a subset of available Gene Expression Omnibus studies, focus on specific tissues, or require substantial manual intervention when metadata fields and supplementary files are inconsistently structured across studies. We developed MethylCurate, an agentic AI framework that addresses these limitations by automating the retrieval of DNA methylation datasets from the Gene Expression Omnibus, harmonizing heterogeneous metadata, mapping datasets to a unified format, and enabling scalable evaluation of epigenetic aging clocks through an integrated, dialogue-driven workflow. Availability and ImplementationMethylCurate is implemented in Python and combines deterministic modules for Gene Expression Omnibus dataset retrieval, quality control, and clock evaluation with large language model-assisted agents for metadata extraction, metadata harmonization, and DNA methylation data parsing. Source code, documentation, and example workflows are available at: https://github.com/Travyse/methylcurate Contacttravyse.edwards@pennmedicine.upenn.edu Supplementary InformationSupplementary data are available at Bioinformatics online. Graphical AbstractMethylCurate is an agentic-AI framework that converts user-specified NCBI Gene Expression Omnibus DNA methylation datasets into standardized metadata, beta matrices, artifacts, logs, and aging clock benchmarking outputs through automated retrieval, quality control, metadata extraction, harmonization, and evaluation workflows. Figure generated with Biorender. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=51 SRC="FIGDIR/small/723515v1_ufig1.gif" ALT="Figure 1"> View larger version (12K): org.highwire.dtl.DTLVardef@197c0fborg.highwire.dtl.DTLVardef@1feace4org.highwire.dtl.DTLVardef@108b0d5org.highwire.dtl.DTLVardef@191a1b8_HPS_FORMAT_FIGEXP M_FIG C_FIG Key MessagesO_LIAutomated curation of DNA methylation datasets from the Gene Expression Omnibus. C_LIO_LIStandardized preprocessing and metadata harmonization. C_LIO_LIIntegrated benchmarking of epigenetic aging clocks. C_LI

7
ChironRNA: Steric Clashes Resolution in RNA Structures via E(3)-Equivariant Diffusion

Li, J.; Wang, J.; Dokholyan, N. V.

2026-03-19 biophysics 10.64898/2026.03.18.712772 medRxiv
Top 0.1%
14.6%
Show abstract

Due to the limited resolution of experimental data, many determined RNA structures contain physically implausible geometries, such as severe steric clashes and missing atoms. Resolving these defects during RNA structure refinement remains a fundamental challenge. Structure dictates the function, so the geometric accuracy of RNA structure is critical for understanding biological mechanisms. However, traditional algorithms for correction have limitations because of the complexity of RNA structures. We propose ChironRNA, an all-atom diffusion model with E(3)-equivariant graph neural networks to perform RNA refinement by resolving steric clashes and completing missing atoms. In ChironRNA, we adopt a hierarchical approach, including both an all-atom diffusion model and a coarse-grained diffusion model where each nucleotide is represented by a five-point representation. Our pipeline consists of two stages: a training stage and a generation stage. The diffusion model regenerates clashing nucleotide atoms step by step by removing the noise predicted by EGNN. ChironRNA achieves an 80% clash reduction on more than 80% of the test set. It performs better on structures of less than 200 nucleotides, resulting in a high percentage of cases having over 80% clash reduction rate and 100% atom reconstruction rate. Our results demonstrate that ChironRNA successfully resolves steric clashes and rebuilds missing atoms with high precision, offering a robust solution where traditional fine-tuning or enumerative approaches fail.

8
Developing SCL2205 : A Protein Sequence-based Spatial Modelling Dataset for the Protein Language Model Frontier

Ouso, D.; Pollastri, G.

2026-03-10 bioinformatics 10.64898/2026.03.08.710388 medRxiv
Top 0.1%
14.4%
Show abstract

Deep learning (DL) has advanced computational genome annotation tasks such as protein sub-cellular localisation (SCL) prediction. Nonetheless, its potential remains underutilised, primarily because of the limited availability of high-quality reference data and suboptimal input preparation strategies. In this study, we develop and analyse a high-quality dataset derived from the latest release of the universal protein knowledgebase (UniProtKB), designed to address existing challenges and support robust DL-based SCL modelling. The dataset was constructed through extensive quality preprocessing to ensure reliability, manual label mapping to enhance the quantity and diversity of the training data, and stringent partitioning to minimise data leakage. We validated the dataset using independent test sets, achieving up to 10.8% performance improvement, measured by the area under the precision-recall curve (PR-AUC), compared to the state-of-the-art (SoTA). Furthermore, we highlighted potential performance metric inflation in existing SoTA predictors by demonstrating, for the first time, at least 4.8% training-to-testing data leakage (pre-sequence representation) when using only 10% of the training set under homology augmentation (augmentation based on sequence similarity database searches; details in Sub-section 2.1), a commonly used data augmentation strategy in DL-based SCL prediction modelling. SCL2205 will efficiently support the development of robust, trustworthy, and generalisable DL-based SCL predictors, while minimising data leakage and promoting reproducibility. It is openly available under the Creative Commons Zero (CC0 1.0) licence on DRYAD and is conveniently deployed as a package on the Python Package Index - p-scldata.

9
Uncertainty-aware graph representation learning with positive-unlabeled classification for biomarker discovery in peripheral artery disease

Ayyalasomayajula, V. S. R. K.; Senders, M. L.; Wolterink, J. M.; Yeung, K. K.

2026-05-13 systems biology 10.64898/2026.05.08.723757 medRxiv
Top 0.1%
14.3%
Show abstract

Peripheral artery disease (PAD) is a complex vascular disorder characterized by heterogeneous molecular mechanisms and incomplete functional annotation, limiting systematic biomarker discovery. Network-based learning approaches provide a powerful framework for disease gene prioritization; however, most existing methods produce overconfident predictions without explicitly accounting for model uncertainty or structural novelty. Here, we present an uncertainty-aware framework for PAD biomarker discovery that integrates unsupervised graph representation learning, positive-unlabeled (PU) classification, ensemble prediction, and mechanistic explainability. Node embeddings were learned using multiple unsupervised graph neural network (GNN) objectives and combined with heterogeneous classifiers to generate ensemble-averaged probability estimates and epistemic uncertainty. By jointly modeling predictive confidence and embedding-space novelty, we stratified candidates into high-confidence rediscoveries and structurally novel hypotheses under explicit uncertainty control. Across eight embedding objectives and five classifiers, ensemble aggregation produced stable, well-calibrated predictions and enabled prioritization of 100 candidate PAD-associated proteins. Probability-heavy candidates clustered tightly with known PAD proteins and were enriched for established vascular and hemostatic pathways, including extracellular matrix organization, integrin signaling, coagulation, and fibrinolysis. In contrast, novelty-heavy candidates occupied distinct embedding-space regions and partitioned into multiple coherent clusters enriched for upstream regulatory and signaling processes, including G protein-coupled receptor, ephrin receptor, kinase-driven, and NF-{kappa}B-associated pathways. Five-fold cross-validated comparison with established PU learning baselines demonstrated consistent improvement across all evaluation metrics (AUC 0.916 {+/-} 0.019 vs. 0.821 {+/-} 0.030 for the best baseline), and external validity was confirmed by significant enrichment of top candidates for related cardiovascular disease annotations (5.7x above background). Together, these results demonstrate that integrating uncertainty, novelty, and explainability enables calibrated and biologically grounded biomarker prioritization, with broad applicability to PAD and other complex diseases. Author summaryPeripheral artery disease affects millions of people worldwide but remains underdiagnosed, partly because we lack reliable molecular markers to detect it early. In this study, we developed a computational framework that uses protein interaction network data to predict which proteins may be involved in PAD, even when we only know a small number of confirmed disease-associated proteins. Our approach combines graph neural network embeddings with a machine learning technique called positive-unlabeled learning, which is specifically designed for situations where you have confirmed positives but no confirmed negatives. We also quantify how confident the model is in each prediction and identify candidates that are genuinely novel compared to what is already known. Tested against established methods, our framework consistently found more known disease proteins in cross-validated evaluation. The candidates we identified map to biologically coherent pathways relevant to vascular disease, and our top predictions are enriched for proteins associated with related cardiovascular conditions, providing external validation. This work provides a principled and transparent approach to biomarker discovery that could be applied to other complex diseases with limited molecular annotations.

10
dAMN: a genome scale neural-mechanistic hybrid model to predict bacterial growth dynamics

Faulon, J.-L.; Dursoniah, D.; Ahavi, P.; Raynal, A.; Asin-Garcia, E.

2026-03-06 systems biology 10.64898/2026.03.04.709593 medRxiv
Top 0.1%
13.2%
Show abstract

SummaryThis study presents dAMN, a hybrid neural-mechanistic model that integrates neural networks with genome-scale dynamic flux balance analysis (dFBA) to predict bacterial growth curves across diverse nutrient environments. dAMN uses neural networks to infer dynamic behavior from initial metabolite concentrations, while mechanistic constraints ensure stoichiometric and thermodynamic consistency based on genome scale metabolic models. dAMN is trained on E. coli and P. putida experimental growth data from media containing various combinations of sugars, amino acids, and nucleobases, and evaluated on two test sets: one for forecasting over time and another for predicting growth dynamics on unseen media. dAMN achieved high predictive power (R2 [≥] 0.9), successfully reproducing growth and substrate depletion dynamics including acetate overflow and glucose-acetate consumption shift for E. coli. An interesting innovation of dAMN is the treatment of the lag phase, enabling realistic adaptation dynamics absent from standard dFBA models. dAMN stands out for its ability to generalize across combinatorial nutrient inputs and produce full growth-curve predictions from minimal input data. Availability and implementationThe dAMN software, along with the associated models and data, is available at https://github.com/brsynth/dAMN-main-release and via DOI 10.5281/zenodo.17908125

11
GeNETop: Context-Specific Genome-Scale Constrained Models Using Network Topology, Flux Variability, and Transcriptomics

Troitino-Jordedo, D.; Mansouri, A.; Minebois, R.; Querol, A.; Remondini, D.; Balsa-Canto, E.

2026-03-18 systems biology 10.64898/2026.03.16.712013 medRxiv
Top 0.1%
12.9%
Show abstract

Context-specific genome-scale metabolic models are critical tools for studying cellular metabolism under dynamic conditions. However, most existing methods for deriving these models are designed for steady-state settings and may fail to preserve reactions required for transient metabolic shifts, thereby limiting their compatibility with dynamic FBA. Here, we present GeNETop, a methodology for deriving context-specific GEMs designed to preserve dynamic compatibility. GeNETop integrates flux variability analysis (FVA), network topology metrics based on the Integrated Value of Influence (IVI), and transcriptomic data to identify reactions that are both flux-flexible and structurally influential. Reactions are prioritized based on variability and maximality indices, while topology and gene expression guide further refinement, reducing dependence on fixed expression thresholds. Using batch fermentation of Saccharomyces cerevisiae as a case study, we evaluate GeNETop against established methods for context-specific metabolic reconstruction. The resulting networks remain dynamically feasible across growth phases, capture key metabolic transitions, reduce non-essential reactions, and maintain computational tractability. Overall, GeNETop enables context-specific metabolic reconstructions that are compatible with dynamic simulations while maintaining computational efficiency. By overcoming key limitations of existing approaches, the method supports a more accurate representation of time-dependent metabolic processes in biotechnology and systems biology. Author summaryCellular metabolism relies on complex networks of reactions to process nutrients, generate energy, and build essential compounds for biomass. Context-specific metabolic models aim to represent only the reactions active under a given condition, improving biological realism and reducing computational complexity in flux balance analysis simulations. However, metabolic activity adapts dynamically to changing environmental conditions, and reactions that are inactive at one stage may become essential at another. Many current reconstruction methods are designed for steady-state conditions and may exclude reactions that are required during metabolic transitions, thereby limiting their ability to describe dynamic behavior. Here, we introduce GeNETop, a novel approach that refines context-specific networks by integrating multiple layers of information. GeNETop identifies the most relevant reactions by considering their flexibility, importance within the network topology, and gene activity levels. In this way, the method generates biologically meaningful models that focus on metabolic pathways relevant under dynamic conditions. We tested GeNETop on yeast fermentation, a key process in food and biofuel production. The resulting models capture metabolic changes over time and enable stable dynamic simulations, supporting improved flux balance analysis of time-dependent metabolic processes.

12
MechAInistic: An LLM-guided Multi-Agent System for Reasoning over Genome-Scale Constraint-Based Metabolic Models

Loecker, J.; Pujara, N.; Bryant, W.; Puniya, B. L.; Packrisamy, P.; Hamed, A.; Helikar, T.

2026-05-13 systems biology 10.64898/2026.05.11.723319 medRxiv
Top 0.1%
12.4%
Show abstract

Constraint-based metabolic modeling is a powerful way to study the mechanistic basis of cellular states and disease, but effective use demands substantial computational expertise and careful coordination of multi-step analyses. We developed MechAInistic to lower this barrier enabling researchers to ask complex biological questions in natural language. MechAInistic is a multi-agent system harnessing large language models organized around an Architect-Reviewer pattern that that converts a natural-language question into an executable, model-grounded workflow and produces a structured report. It supports pathway comparison, perturbation analysis, drug-target exploration, and literature interpretation across healthy and disease paired states. We evaluated MechAInistics therapeutic hypothesis generation using two immune-cell use-cases. For rheumatoid arthritis/healthy Naive B models, it identified mitochondrial metabolic rewiring and nominated Devimistat/CPI-613 as an investigational OGDH-centered hypothesis. In CD4+ Th17 multiple sclerosis/healthy models, the workflow identified NADP-dependent isocitrate dehydrogenase as the optimal target and proposed Ivosidenib as an FDA-approved repurposing candidate. GRAPHICAL ABSTRACT O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=83 SRC="FIGDIR/small/723319v1_ufig1.gif" ALT="Figure 1"> View larger version (19K): org.highwire.dtl.DTLVardef@1b5c1d1org.highwire.dtl.DTLVardef@1c798cforg.highwire.dtl.DTLVardef@10161d3org.highwire.dtl.DTLVardef@1bd7dce_HPS_FORMAT_FIGEXP M_FIG C_FIG

13
GMIP-PLSR: A Nextflow Pipeline for GWAS and Multi-Omics Integration in Gene Prioritization Using PLSR

Kanchwala, M. S.; Xing, C.; Xuan, Z.

2026-04-09 bioinformatics 10.64898/2026.04.06.716845 medRxiv
Top 0.1%
12.3%
Show abstract

Genome-wide association studies (GWAS) have significantly advanced our understanding of complex traits and diseases, but their interpretive power remains limited due to challenges in identifying causal genes and pathways. Integrating GWAS with multi-omics data--such as gene expression, protein-protein interactions, and gene-pathway networks have the potential to enhance biological insights and improve gene prioritization. To fulfill this potential and need, we developed the GWAS & Multi-omics Integration Pipeline (GMIP), a flexible and scalable framework that incorporates widely used tools such as PoPS, MAGMA, and benchmarker to enrich GWAS findings. However, PoPS suffers from multicollinearity in its features, which can impact performance. To overcome this, we introduce GMIP-PLSR, an extension of GMIP that uses Partial Least Squares Regression (PLSR) to manage multicollinearity effectively. We applied GMIP-PLSR across multiple GWAS datasets, demonstrating superior performance over PoPS in most cases. In a case study on NAFLD, GMIP-PLSR, using features derived from both disease-specific scRNA-seq and general PoPS features, identified gene sets with higher heritability and stronger enrichment in known NAFLD pathways, confirming its ability to enhance GWAS findings. Built on Nextflow, GMIP is computationally efficient, adaptable to diverse research environments, and provides a robust solution for gene reprioritization in post-GWAS analyses. GMIP-PLSR is available at https://github.com/mohammedmsk/GMIP.

14
DIOPT: the DRSC Integrative Ortholog Prediction Tool, 2026 update

Hu, Y.; Comjean, A.; Gao, C.; Yamamoto, S.; Mohr, S.; Perrimon, N.

2026-04-16 bioinformatics 10.64898/2026.04.15.718708 medRxiv
Top 0.1%
12.3%
Show abstract

Mapping orthologous proteins is a critical step for cross-species literature mining, data integration, experimental design, and more, making the ability to quickly predict orthologs across species a key tool for functional genomic studies. The DRSC Integrative Ortholog Prediction Tool (DIOPT) was initially developed in 2011 to provide a centralized portal for identifying predicted orthologs among major model organisms. By integrating results from multiple ortholog prediction algorithms, DIOPT allows users to compare predictions across methods and prioritize high-confidence ortholog relationships. Over the years, we regularly updated the underlying genome annotations and refreshed predictions from each integrated algorithm. In addition, both the number of supported species and the number of ortholog prediction algorithms incorporated into the platform have grown. The web portal has also been enhanced with new features designed to improve usability, facilitate data exploration, and support a broader range of research applications. We also developed a sister version of DIOPT tailored specifically for arthropod species; this enables researchers working with a diverse set of insects and related organisms to perform ortholog mapping and comparative analyses more effectively. Together, these developments ensure that DIOPT remains a robust and broadly useful resource for functional genomics research.

15
DualLoc: Full-parameter fine-tuning of cascaded dual transformers for protein subcellular localization prediction

Chen, Y. G.; Chung, W.-Y.; Chang, K. Y.

2026-03-30 bioinformatics 10.64898/2026.03.27.714699 medRxiv
Top 0.1%
12.2%
Show abstract

Accurate protein subcellular localization is essential for biological function, and mislocalization is linked to numerous diseases. While current methods like DeepLoc 2.0 employ lightweight fine-tuning of protein language models (PLMs), their ability to predict multi-compartment localization remains limited. To address this, we introduce DualLoc, a multi-label localization predictor for ten compartments. DualLoc leverages full-parameter fine-tuning of a cascaded dual-transformer architecture, built upon foundational PLMs and augmented with attention and dropout layers. We evaluated this framework using three foundational PLMs--ProtBERT, ESM-2, and ProtT5--as backbones. Cross-validation on Swiss-Prot and independent validation on the Human Protein Atlas demonstrate consistent superiority over state-of-the-art baselines. The best-performing variant, DualLoc-ProtT5, achieves 0.5872 accuracy, 0.8271 micro-F1, and 0.7811 macro-F1, with substantial gains in the Matthews correlation coefficient for the nucleus (+0.13), cell membrane (+0.13), and extracellular space (+0.07). Pointwise mutual information analysis of model outputs reveals biologically relevant compartment couplings, notably between the Golgi apparatus and endoplasmic reticulum (PMI = 0.25, P < 10-6), accurately reflecting secretory pathway coordination. DualLoc provides both a highly accurate predictive tool and a robust framework for investigating protein multi-localization mechanisms. Author summaryWhere a protein resides within a cell determines what it does. When proteins end up in the wrong location, normal cellular function breaks down--a misplacement linked to diseases like cancer and Alzheimers. While computational tools exist to predict these locations, accurately tracking proteins that multitask across multiple cellular compartments simultaneously remains a major challenge. We developed DualLoc, a new approach that predicts protein locations across ten different cellular compartments, from the nucleus to the cell membrane. By training an advanced artificial intelligence model on large protein sequence databases, our method more accurately identifies where proteins go, especially in complex, multi-location scenarios. Importantly, our analysis revealed meaningful biological patterns. We found strong predictive links between compartments that work closely together, such as the Golgi apparatus and the endoplasmic reticulum--two organelles that coordinate protein processing and transport. This suggests our model captures genuine cellular logic rather than simply memorizing data. By improving how we predict protein localization, DualLoc helps researchers better understand normal cellular function and disease mechanisms. Our method is freely available to the biomedical community.

16
Robust Random Forests for Genomic Prediction: Challenges and Remedies

Lourenco, V. M.; Ogutu, J. O.; Piepho, H.-P.

2026-04-01 bioinformatics 10.64898/2026.03.30.715203 medRxiv
Top 0.1%
12.1%
Show abstract

Data contamination--from recording errors to extreme outliers--can compromise statistical models by biasing predictions, inflating prediction errors, and, in severe cases, destabilizing performance in high-dimensional settings. Although contamination can affect responses and covariates, we focus on response contamination and evaluate Random Forests through simulation. Using a synthetic animal-breeding dataset, we assess robust Random Forests across several contamination scenarios and validate them on plant and animal datasets. We thereby clarify the consequences of contamination for prediction, develop a robust Random Forest framework, and evaluate its performance. We examine preprocessing or data-transformation strategies, algorithmic modifications, and hybrid approaches for robustifying Random Forests. Across these approaches, data transformation emerges as the most effective strategy, delivering the strongest performance under contamination. This strategy is simple, general, and transferable to other Machine Learning methods, offering a remedy for robust genomic prediction. In real breeding data, robust Random Forests are useful when substantial contamination, phenotypic corruption, misrecording, or train-deployment mismatch is plausible and the goal is to recover a latent signal for genomic prediction and selection; ranking-based robust Random Forests are the dependable first option, whereas weighting-based Random Forests should be used only when their weighting scheme preserves rank structure and improves prediction. Robustification is not universally necessary, but it becomes important when contamination distorts the link between observed responses and the predictive target; standard Random Forests remain the default for clean data, whereas robust Random Forests should be fitted alongside them whenever contamination is plausible, with the final choice guided by data, trait, and breeding objective. Author summaryMachine learning (ML) methods are widely used for prediction with high-dimensional, complex data, and supervised approaches such as Random Forests (RF) have proved effective for genomic prediction (GP) and selection. Yet their performance can be severely compromised by data contamination if the algorithms rely on classical data-driven procedures that are sensitive to atypical observations. Robustifying ML methods is therefore important both for improving predictive performance under contamination and for guiding their practical use in high-dimensional prediction problems. To address this need, we develop robust preprocessing, algorithm-level, and hybrid strategies for improving RF performance with contaminated data. Using simulated animal data, we show that ranking-and weighting-based robust RF provide the strongest overall compromise for genomic prediction and selection under contamination. Validation on several plant and animal breeding datasets further shows that the benefits of robustification are not universal, but depend on the dataset, trait, and breeding objective. Although motivated by RF, the framework we propose is general, practical, and readily transferable to other ML methods. It also offers a basis for deciding when robustness should complement standard RF rather than replace it outright.

17
Attentive-SPIDNA: Attention-based neural networks for population genetics

Sanchez, T.; Jobic, P.; Regan, C.; Verdu, P.; Charpiat, G.; Jay, F.

2026-04-18 evolutionary biology 10.64898/2026.04.15.718687 medRxiv
Top 0.1%
12.1%
Show abstract

Artificial neural networks (ANNs) have recently offered new perspectives to solve inference problems from high dimensional data in numerous scientific fields, but it is yet unclear which architectures are the most suited to genomic data. Here, we present a new ANN architecture integrating attention mechanisms to infer effective population size history from genomic data. Built upon our previous exchangeable architecture SPIDNA, Attentive-SPIDNA adds attention layers that allow computing more expressive and complex features from combinations of haplotypes. The contribution of each haplotype to the features is learned automatically and depends on its content and affinity with the other haplotypes. Likewise, we use this mechanism to automatically perform a voting scheme that aggregates predictions from different genomic regions. This new architecture outperforms approximate Bayesian computation and previously published neural networks while relying directly on raw genetic data and being invariant to haplotype permutation in the input. As a proof-of-concept, we use this architecture to infer the effective population size history of 54 populations from the HGDP dataset (Bergstrom et al, 2020). This application highlights the ability of the network to handle data with a varying number of haplotypes and to quickly perform predictions for datasets including numerous populations. Therefore, the proposed mechanism could be integrated to various neural networks solving population genetics tasks.

18
Correlate: A Web Application for Analyzing Gene Sets and Exploring Gene Dependencies Using CRISPR Screen Data

Deolankar, S.; Wermeling, F.

2026-04-04 bioinformatics 10.64898/2026.04.02.716070 medRxiv
Top 0.1%
10.7%
Show abstract

CRISPR screen data provides a valuable resource for understanding gene function and identifying potential drug targets. Here, we present Correlate, a freely accessible web application (https://correlate.cmm.se) that enables exploration of the Cancer Dependency Map (DepMap) CRISPR screen gene effects, hotspot mutations, and translocation/fusion data across more than 1,000 human cancer cell lines. The application supports two main use cases: (i) analysis of user-defined gene sets (e.g. CRISPR screen hits) to identify functionally linked genes based on correlations while providing an overview based on essentiality or user-provided screen statistics; and (ii) exploration of genes of interest in defined biological contexts, such as specific cancer types or mutational backgrounds, to generate hypotheses about gene function and dependencies. Additionally, Correlate supports experimental design by providing rapid overviews of gene essentiality and enabling the identification of cell lines with relevant mutational profiles. In contrast to knowledge-based approaches such as STRING and GSEA, which rely on prior biological annotations and curated interaction networks, Correlate identifies gene connections directly from functional CRISPR screen readouts, offering a complementary and data-driven perspective on gene network analysis. The application runs entirely in the browser, requires no installation or login, and integrates with the Green Listed v2.0 tool family for custom CRISPR screen design. HIGHLIGHTS{blacksquare} Interactive web-based platform for bulk correlation analysis of user-defined gene sets using DepMap CRISPR screen data, requiring no installation or programming expertise. {blacksquare}Identifies functional gene relationships from CRISPR screen readouts rather than curated annotations, offering a data-driven complement to tools such as GSEA and STRING. {blacksquare}Enables contextual exploration of gene dependencies across cancer types and mutational backgrounds, supporting hypothesis generation about gene function and therapeutic targets. {blacksquare}Supports experimental design through gene essentiality overviews, mutation and fusion analysis, and cell line identification, with optional integration of user-provided statistics from CRISPR screens, proteomics, or transcriptomics analyses.

19
Exon Targeted Retrieval and Classification Toolbox (ExTRaCT): a gene search pipeline to find APOBEC3 Z-domains in novel bat genomes

Delamonica, B.; Bat1K 21-Families Group, ; Larijani, M.; MacCarthy, T.; Davalos, L. M.

2026-03-18 genomics 10.64898/2026.03.15.711917 medRxiv
Top 0.1%
10.5%
Show abstract

MotivationSeveral computation gene search tools exist to identify and annotate an ever-growing body of newly sequenced genomes of different species. Many annotation tools, however, fall short when the target species diverges from well-studied model organisms, and when searching for short genes with multiple copies. ResultsWe have developed the Exon Targeted Retrieval and Classification Toolbox, ExTRaCT, an automated pipeline to identify any gene exon with conserved structure in novel species genome assemblies. In the use cases presented here, we applied our search tool to 102 bat genomes to find APOBEC3 gene family members. We show that our homolog search algorithm is efficient (run time average of 5 hours for over 100 genomes), works well with reference sequences distantly related to the target (1 out of 498 misclassifications, 0 false positives and 2 false negatives), and is easy to use. As genomic sequencing becomes faster and more accessible, ExTRaCT has downstream applications in phylogenetic, biochemical and genomic studies. It is a simple computational tool that provides a solution to target gene identification, requiring neither whole-genome-assembly annotations, nor prior knowledge of closely related species. Availabilityhttps://doi.org/10.5281/zenodo.15769018 ContactBrenda.delamonica@stonybrook.edu Supplementary informationSupplementary data are available at Bioinformatics online.

20
SELFormerMM: multimodal molecular representation learning via SELFIES, structure, text, and knowledge graph integration

Ulusoy, E.; Bostanci, S.; Deniz, B. E.; Dogan, T.

2026-03-19 bioinformatics 10.64898/2026.03.17.712340 medRxiv
Top 0.1%
10.4%
Show abstract

MotivationMolecular representation learning is central to computational drug discovery. However, most existing models rely on single-modality inputs, such as molecular sequences or graphs, which capture only limited aspects of molecular behaviour. Yet unifying these modalities with complementary resources such as textual descriptions and biological interaction networks into a coherent multimodal framework remains non-trivial, hindering more informative and biologically grounded representations. ResultsWe introduce SELFormerMM, a multimodal molecular representation learning framework that integrates SELFIES notations with structural graphs, textual descriptions, and knowledge graph- derived biological interaction data. By aligning these heterogeneous views, SELFormerMM effectively captures complementary signals that unimodal approaches often overlook. Our performance evaluation has revealed that SELFormerMM outperforms structure-, sequence-, and knowledge-based models on multiple molecular property prediction tasks. Ablation analyses further indicate that effective cross-modal alignment and modality coverage improve the models ability to exploit complementary information. Overall, integrating SELFIES with structural, textual, and biological context enables richer molecular representations and provides a promising framework for hypothesis-driven drug discovery. AvailabilitySELFormerMM is available as a programmatic tool, together with datasets, pretrained models, and precomputed embeddings at https://github.com/HUBioDataLab/SELFormerMM. Contacttuncadogan@gmail.com